Efficient instruction fusion by fusing instructions that fall within a counter-tracked amount of cycles apart

ABSTRACT

A technique to enable efficient instruction fusion within a computer system. In one embodiment, a processor logic delays the processing of a second instruction for a threshold amount of time if a first instruction within an instruction queue is fusible with the second instruction.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.12/290,395, filed Oct. 30, 2008, entitled “Delayed Processing TechniquesFor Improving Efficient Instructions Fusion”, all of which is hereinincorporated by reference.

FIELD OF THE INVENTION

Embodiments of the invention relate generally to the field ofinformation processing and more specifically, to the field ofinstruction fusion in computing systems and microprocessors.

BACKGROUND

Instruction fusion is a process that combines two instructions into asingle instruction which results in a one operation (or micro-operation,“uop”) sequence within a processor. Instructions stored in a processorinstruction queue (IQ) may be “fused” after being read out of the IQ andbefore being sent to instruction decoders or after being decoded by theinstruction decoders. Typically, instruction fusion occurring before theinstruction is decoded is referred to as “macro-fusion”, whereasinstruction fusion occurring after the instruction is decoded (intouops, for example) is referred to as “micro-fusion”. An example ofmacro-fusion is the combining of a compare (“CMP”) instruction or testinstruction (“TEST”) (“CMP/TEST”) with a conditional jump (“JCC”)instruction. CMP/TEST and JCC instruction pairs may occur regularly inprograms at the end of loops, for example, where a comparison is madeand, based on the outcome of a comparison, a branch is taken or nottaken. Since macro-fusion may effectively increase instructionthroughput, it may be desirable to find as many opportunities to fuseinstructions as possible.

For instruction fusion opportunities to be found in some prior artprocessor microarchitectures, both the CMP/TEST and JCC instructions mayneed to reside in the IQ concurrently so that they can be fused when theinstructions are read from the IQ. However, if there is a fusibleCMP/TEST instruction in the IQ and no further instructions have beenwritten to the IQ (i.e. the CMP/TEST instruction is the last instructionin the IQ), the CMP/TEST instruction may be read from the IQ and sent tothe decoder without being fused, even if the next instruction in programorder is a JCC instruction. An example where a missed fusion opportunitymay occur is if the CMP/TEST and the JCC happen to be across a storageboundary (e.g., 16 byte boundary), causing the CMP/TEST to be written inthe IQ in one cycle and the JCC to be written the following cycle. Inthis case, if there are no stalling conditions, the JCC will be writtenin the IQ at the same time or after the CMP/TEST is being read from theIQ, so a fusion opportunity will be missed, resulting in multipleunnecessary reads of the IQ, reduced instruction throughput andexcessive power consumption.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention are illustrated by way of example, and notby way of limitation, in the figures of the accompanying drawings and inwhich like reference numerals refer to similar elements and in which:

FIG. 1 illustrates a block diagram of a microprocessor, in which atleast one embodiment of the invention may be used;

FIG. 2 illustrates a block diagram of a shared bus computer system, inwhich at least one embodiment of the invention may be used;

FIG. 3 illustrates a block diagram a point-to-point interconnectcomputer system, in which at least one embodiment of the invention maybe used;

FIG. 4 illustrates a block diagram of a state machine, which may be usedto implement at least one embodiment of the invention;

FIG. 5 is a flow diagram of operations that may be used for performingat least one embodiment of the invention.

FIG. 6 is a flow diagram of operations that may be performed in at leastone embodiment.

DETAILED DESCRIPTION

Embodiments of the invention may be used to improve instructionthroughput in a processor and/or reduce power consumption of theprocessor. In one embodiment, what would be otherwise missedopportunities for instruction fusion or found and instruction fusion mayoccur as a result. In one embodiment, would-be missed instruction fusionopportunities are found by delaying reading of a last instruction froman instruction queue (IQ) or the issuance of the last instruction readfrom the IQ to a decoding phase for a threshold number of cycles, sothat any subsequent fusible instructions may be fetched and stored inthe IQ (or at least identified without necessarily being stored in theIQ) and subsequently fused with the last fusible instruction. In oneembodiment, delaying the reading or issuance of a first fusibleinstruction by a threshold number of cycles may improve processorperformance, since doing so may avoid two, otherwise fusible,instructions being decoded and processed separately rather than as asingle instruction.

The choice of the threshold number of wait cycles may depend upon themicroarchitecture in which a particular embodiment is used. For example,in one embodiment, the threshold number of cycles may be two, whereas inother embodiments, the threshold number of cycles may be more or lessthan two. In one embodiment, the threshold number of wait cyclesprovides the maximum amount of time to wait on a subsequent fusibleinstruction to be stored to the IQ while maintaining an overalllatency/performance advantage in waiting for the subsequent fusibleinstruction over processing the fusible instructions as separateinstructions. In other embodiments, where power is more critical, forexample, the threshold number of wait cycles could be higher in order toensure that extra power is not used to process the two fusibleinstructions separately, even though the number of wait cycles may causea decrease (albeit temporarily) in instruction throughput.

FIG. 1 illustrates a microprocessor in which at least one embodiment ofthe invention may be used. In particular, FIG. 1 illustratesmicroprocessor 100 having one or more processor cores 105 and 110, eachhaving associated therewith a local cache 107 and 113, respectively.Also illustrated in FIG. 1 is a shared cache memory 115 which may storeversions of at least some of the information stored in each of the localcaches 107 and 113. In some embodiments, microprocessor 100 may alsoinclude other logic not shown in FIG. 1, such as an integrated memorycontroller, integrated graphics controller, as well as other logic toperform other functions within a computer system, such as I/O control.In one embodiment, each microprocessor in a multi-processor system oreach processor core in a multi-core processor may include or otherwisebe associated with logic 119 to enable interrupt communicationtechniques, in accordance with at least one embodiment. The logic mayinclude circuits, software or both to enable more efficient fusion ofinstructions than in some prior art implementations.

In one embodiment, logic 119 may include logic to reduce the likelihoodof missing instruction fusion opportunities. In one embodiment, logic119 delays the reading of a first instruction (e.g., CMP) from the IQ,when there is no subsequent instruction stored in the IQ or otherfetched instruction storage structure. In one embodiment, the logic 119causes the reading or issuance of a first fusible instruction for athreshold number of cycles (e.g., two cycles) before reading the IQ orissuing the first fusible instruction to a decoder or other processinglogic, such that if there is a second fusible instruction that can befused with the first instruction not yet stored in the IQ (due, forexample, to the two fusible instructions being stored in a memory orcache in different storage boundaries), the opportunity to fuse the twofusible instructions may not be missed. In some embodiments, thethreshold may be fixed, whereas in other embodiments, the threshold maybe variable, modifiable by a user or according to user-independentalgorithm. In one embodiment, the first fusible instruction is a CMPinstruction and the second fusible instruction is a JCC instruction. Inother embodiments, either or both of the first and second instructionmay not be a CMP or JCC instruction, but any fusible instructions.Moreover, embodiments if the invention may be applied to fusing morethan two instructions.

FIG. 2, for example, illustrates a front-side-bus (FSB) computer systemin which one embodiment of the invention may be used. Any processor 201,205, 210, or 215 may access information from any local level one (L1)cache memory 220, 225, 230, 235, 240, 245, 250, 255 within or otherwiseassociated with one of the processor cores 223, 227, 233, 237, 243, 247,253, 257. Furthermore, any processor 201, 205, 210, or 215 may accessinformation from any one of the shared level two (L2) caches 203, 207,213, 217 or from system memory 260 via chipset 265. One or more of theprocessors in FIG. 2 may include or otherwise be associated with logic219 to enable improved efficiency of instruction fusion, in accordancewith at least one embodiment.

In addition to the FSB computer system illustrated in FIG. 2, othersystem configurations may be used in conjunction with variousembodiments of the invention, including point-to-point (P2P)interconnect systems and ring interconnect systems. The P2P system ofFIG. 3, for example, may include several processors, of which only two,processors 370, 380 are shown by example. Processors 370, 380 may eachinclude a local memory controller hub (MCH) 372, 382 to connect withmemory 32, 34. Processors 370, 380 may exchange data via apoint-to-point (PtP) interface 350 using PtP interface circuits 378,388. Processors 370, 380 may each exchange data with a chipset 390 viaindividual PtP interfaces 352, 354 using point to point interfacecircuits 376, 394, 386, 398. Chipset 390 may also exchange data with ahigh-performance graphics circuit 338 via a high-performance graphicsinterface 339. Embodiments of the invention may be located within anyprocessor having any number of processing cores, or within each of thePtP bus agents of FIG. 3. In one embodiment, any processor core mayinclude or otherwise be associated with a local cache memory (notshown). Furthermore, a shared cache (not shown) may be included ineither processor outside of both processors, yet connected with theprocessors via p2p interconnect, such that either or both processors'local cache information may be stored in the shared cache if a processoris placed into a low power mode. One or more of the processors or coresin FIG. 3 may include or otherwise be associated with logic 319 toenable improved efficiency of instruction fusion, in accordance with atleast one embodiment.

In at least one embodiment, a second fusible instruction may not bestored into an IQ before some intermediate operation occurs (occurringbetween a first and second fusible instruction), such as an IQ clearoperation, causing a missed opportunity to fuse the two otherwisefusible instructions. In one embodiment, in which a cache (or a buffer)stores related sequences of decoded instructions (after they were readfrom the IQ and decoded) or uops (e.g., “decoded stream buffer” or“DSB”, “trace cache”, or “TC”) that are to be scheduled (perhapsmultiple times) for execution by the processor, a first fusible uop(e.g., CMP) may be stored in the cache without a fusible second uop(e.g., JCC) within the same addressable range (e.g., same cache way).This may occur, for example, where JCC is crossing a cache line (due toa cache miss) or crossing page boundary (due to a translation look-asidebuffer miss), in which case the cache may store the CMP without the JCC.Subsequently, if the processor core pipeline is cleared (due to a“clear” signal being asserted, for example) after the CMP was stored butbefore the JCC is stored in the cache, the cache store only the CMP inone of its ways without the JCC.

On subsequent lookups to the cache line storing the CMP, the cache mayinterpret the missing JCC as a missed access and the JCC may be markedas the append point for the next cache fill operation. This appendpoint, however, may not be found since the CMP+JCC may be read as fusedfrom the IQ. Therefore, the requested JCC may not match any uop to befilled, coming from the IQ, and thus the cache will not be able to fillthe missing JCC, but may continually miss on the line in which the fusedCMP+JCC is expected. Moreover, in one embodiment in which a pending fillrequest queue (PFRQ) is used to store uop cache fill requests, an entrythat was reserved for a particular fused instruction fill may notdeallocate (since the expected fused instruction fill never takes place)and may remain useless until the next clear operation. In oneembodiment, a PFRQ entry lock may occur every time the missing fusedinstruction entry is accessed, and may therefore prevent any subsequentfills to the same location.

In order to prevent an incorrect or undesirable lock of the PFRQ entry,a state machine, in one embodiment, may be used to monitor the uopsbeing read from the IQ to detect cases, in which a region that has acorresponding PFRQ entry (e.g., a region marked for a fill) wascompletely missed, due for example, to the entry's last uop beingreached without the fill start point being detected. In one embodiment,the state machine may cause the PFRQ entry to be deallocated when thiscondition is met. In other embodiments, an undesirable PFRQ entry lockmay be avoided by not creating within the cache a fusible instructionthat may be read from the IQ without both fusible instructions present.For example, if a CMP is followed by a non-JCC instruction, a fusedinstruction entry may be created in the cache, but only if the CMP isread out of the IQ alone (after the threshold wait time expires, forexample), is the fused instruction entry not filled to the cache. Inother embodiments, the number of times the state machine has detected afill region that was skipped may be counted, a cache flush orinvalidation operation may be performed after some threshold count oftimes the fill region was skipped. The fill region may then be removedfrom the cache, and the fused instruction may then be re-filled.

FIG. 4 illustrates a state machine, according to one embodiment, thatmay be used to avoid unwanted PFRQ entry lock conditions due to a missedfusible instruction in the IQ. At state 401, in which the instructionsin the IQ are not in a region marked for fill, a “fill region start”signal indicating that the IQ is about to process an instruction that ismapped to a fill-region (an instruction from the fill region accordingto the cache hashing) but does not start at the linear instructionpointer saved in the PFRQ (“lip”) 405. This may cause the state machineto move to state 410. If the next instruction in the IQ (that will soonbe decoded) ends a fill region (e.g. ends a line as hashed by the cache,or is a taken branch), then the state machine causes the deallocation415 of the corresponding PFRQ entry and the state machine returns tostate 401. If, however, the fill pointer is equal to the fill region lip430 while either in state 401 or state 410, the state machine entersstate 420, in which the access is within the fill region and after fillstart point. From state 420, a last uop in the fill region indicationwill return 425 the state machine to state 401 without deallocation thecorresponding PFRQ entry. The state machine of FIG. 4 may be implementedin hardware logic, software, or some combination thereof. In otherembodiments, other state machines or logic may be used.

FIG. 5 illustrates a flow diagram of operations that may be used inconjunction with at least one embodiment of the invention. At operation501, it is determined whether the currently accessed instruction in theIQ is fusible with any subsequent instruction. If not, then at operation505, the next instruction is accessed from the IQ and the delay count isreset. If so, then at operation 510, a delay counter is incremented andat operation 515 it is determined whether the delay count threshold isreached. If it isn't, then at operation 520, no instruction fusion ofthe currently accessed instruction is performed. If it is, then the nextinstruction is accessed from the IQ and the delay count is reset atoperation 505. In other embodiments, other operations may be performedto improve the efficiency of instruction fusion.

FIG. 6 illustrates a flow diagram of operations that may be performed inconjunction with at least one embodiment. In order to perform oneembodiment in processors having a number of decoder circuits, it may behelpful to ensure that the first fusible instruction is to be decoded ona particular decoder circuit, which is capable of decoding the fusedinstruction. In FIG. 6, it is determined whether a particularinstruction can be a first of a fused pair of instructions at operation601. If not, then the fused instructions are issued at operation 605. Ifso, then it is determined whether the first fusible instruction isfollowed by a valid instruction in the IQ at operation 610. If so, thenthe fused instructions are issued at operation 610. If not, then atoperation 615, it is determined whether the first fusible instruction isto be issued to a decoder capable of supporting the fused instruction.In one embodiment, decoder-0 is capable of decoding the fusedinstructions. If the first fusible instruction was not issued todecoder-0, then at operation 620, the first fusible instruction ismoved, or “nuked”, to a different decoder until it corresponds todecoder-0. At operation 625, a counter is set to an initial value, N andat operation 630, if the instruction is followed by a valid instructionor the counter is zero, then the fused instructions are issued atoperation 635. Otherwise, at operation 640, the counter is decrementedand the invalid instruction is nuked. In other embodiments, the countermay increment to a final value. In other embodiments, other operations,besides a nuke operation may clear the invalid instruction.

One or more aspects of at least one embodiment may be implemented byrepresentative data stored on a machine-readable medium which representsvarious logic within the processor, which when read by a machine causesthe machine to fabricate logic to perform the techniques describedherein. Such representations, known as “IP cores” may be stored on atangible, machine readable medium (“tape”) and supplied to variouscustomers or manufacturing facilities to load into the fabricationmachines that actually make the logic or processor.

Thus, a method and apparatus for directing micro-architectural memoryregion accesses has been described. It is to be understood that theabove description is intended to be illustrative and not restrictive.Many other embodiments will be apparent to those of skill in the artupon reading and understanding the above description. The scope of theinvention should, therefore, be determined with reference to theappended claims, along with the full scope of equivalents to which suchclaims are entitled.

What is claimed is:
 1. A system comprising: a plurality of cores toprocess instructions; a memory controller to connect the system to asystem memory; one of the cores comprising: an instruction memory tostore instructions; a decoder to decode instructions; an instructionfusion circuit to fuse a first instruction and a second instruction toform a fused instruction to be processed by the core as a singleinstruction; and the instruction fusion circuit to fuse the first andsecond instructions when both the first and second instructions havebeen stored in the instruction memory prior to issuance.
 2. The systemas in claim 1 further comprising: a cache shared by two or more of theplurality of cores.
 3. The system as in claim 1 further comprising: asystem interconnect to communicatively couple the system to one or moreother components.
 4. A method comprising: processing instructions on aplurality of cores; connecting a system to a system memory; storing theinstructions in a an instruction memory; decoding the instructions;fusing a first instruction and a second instruction to form a fusedinstruction to be processed by the core as a single instruction; andfusing the first and second instructions when both the first and secondinstructions have been stored in the instruction memory prior toissuance.
 5. A system comprising: means for processing instructions on aplurality of cores; means for connecting a system to a system memory;means for storing the instructions in an instruction memory; means fordecoding the instructions; means for fusing a first instruction and asecond instruction to form a fused instruction to be processed by thecore as a single instruction; and means for fusing the first and secondinstructions when both the first and second instructions have beenstored in the instruction memory prior to issuance.
 6. The system as inclaim 5 further comprising: means for communicatively coupling thesystem to one or more other components.